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Projection pursuit for discrete data 

Persi Diaconis^ and Julia Salzman*'^ 

Stanford University 

Abstract: This paper develops projection pursuit for discrete data using the 
discrete Radon transform. Discrete projection pursuit is presented as an ex- 
ploratory method for finding informative low dimensional views of data such 
as binary vectors, rankings, phylogenetic trees or graphs. We show that for 
most data sets, most projections are close to uniform. Thus, informative sum- 
maries are ones deviating from uniformity. Syllabic data from several of Plato's 
great works is used to illustrate the methods. Along with some basic distribu- 
tion theory, an automated procedure for computing informative projections is 
introduced. 



1. Introduction 

Projection pursuit is an exploratory graphical tool for picturing high dimensional 
data through low dimensional projections. Introduced by Kruskal [35], [3G], and 
developed by Friedman and Tukey [28], the idea is to have the computer select 
a small family of projections by numerically optimizing an index of "interest" . 
The original projection indices were ad hoc. In joint work with David Freedman 
[15], it was shown that for most data sets, most projections are about the same: 
approximately Gaussian. Therefore, the interesting projections, the ones which were 
special for this data set, are projections that are far from Gaussian. 

Peter Huber [32] found his own version of this: projections are uninformative if 
they are unstructured or "random" . Thus projections with high entropy are unin- 
formative. For a fixed scale, distributions having high entropy are approximately 
Gaussian. Huber also showed that the Friedman- Tukey index is a measure of non- 
Gaussianess. 

The purpose of the present paper is to give a parallel development for data in 
discrete spaces: collections of binary vectors, rankings, phylogenetic trees or sets 
of graphs. We develop a notion of projection as a partition of the discrete data 
into blocks. We show that most for most data sets, most projections are close to 
uniformly partitioned. This suggests that the informative summaries arc the ones 
with splits that are far from uniform. 

The outline of the paper is as follows. Definitions and first examples are given 
in Section 2. The ideas lean on classical developments in block designs and give 
new applications for that theory. A discrete version of the Radon transform along 
with an inversion theory is presented, determining when a collection of projections 
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loses information. Section 3 gives a data analytic example in some detail. The 
data arise from the problem of putting some of Plato's works in chronological 
order. Here, discrete projection pursuit leads to the discovery of a striking, easily 
interpretable structure that does not appear in other analyses of this data (eg. 
Ahn et al. [1], Cox and Brandwood [II], Charnomordic and Holmes [n], Wishart 
and Leach [4!)]). Section 4 proves that for most data sets, most partitions lead 
to approximately uniform projections. This leads directly to a usable criteria: a 
projection is interesting if it is far from uniform. The distance to uniformity can be 
measured by any distance between probabilities, and we consider the well-known 
total variation, Hellinger and Vasserstcin metrics. 

The final section gives results for the least uniform projection. Theorem 5.1 
shows that if the class of projections is not too rich, for example, the afSne hy- 
perplanc in Z^, then for most data sets even the least uniform partition is close 
to uniform. If the class of projections contains many sets, then least uniform pro- 
jections are "structured" . The final theorem attacks the problem of a data analyst 
finding "structure" in "noise". Computational details for computing the metrics 
and automating the analysis are in an Appendix. 

There has been extensive development of projection pursuit for density esti- 
mation (Friedman et al. [26]), regression (Friedman and Stuctzle [27], Hall [■>()]), 
applications to time series (Donoho [22]), discriminant analysis (Posse [42], Polzehl 
[41]) and standard multivariate problems such as covariance estimation (Hwang et 
al. [33]). This has led to a healthy development captured in the modern implemen- 
tations (Xgobi, Ggobi). Online documentation for this software is an instructive 
catalog. We have not attempted to develop our ideas in these directions, but the 
beginning steps of ridge functions will be found below. 

2. Projections and Radon transforms 

This section introduces our notation and set up for working with discrete data. 
It defines projection bases, the discrete Radon transform and gives examples with 
binary data and permutation data. Analysis will be performed on binary n-tuple 
data from several works of Plato. Let X he a, finite set. Let 3^ be a class of subsets 
of X. Let f : X ^ Rhe a. function. The Radon transform of / at y G 3^ is defined 

by 

(1) /(y) = E/(^)- 

xey 

The class y is called a projection base if: 

(2) \y\ is constant for y £ y {\y\ denotes the cardinality of 3^). 

(3) There is a partition pi, . . . ,pj oiy such that each pi is a partition of X . 

For a partition p, the numbers f{y)y^p will be called the projection of / in 
direction p. The sets in y may be thought of as "lines" in a geometry. If lines in 
the same partition are called parallel, then (3) corresponds to the Euclidean axiom: 
for every point x G X and every line y G 3^, there is a unique line y* parallel to y 
such that X & y* . In. the statistics literature, designs with property (3) are called 
"resolvable" (See Hedayat et al. [31] or Constantine [10] for examples). Assumption 
(2) guarantees that projections are based on averages over comparable sets. 

Consider the following examples: 
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Table 1 

Percentage distribution of sentence endings 



Type of ending 


Rep. 


Laws 


Phil. 


Pol. 


Soph. 


Tim. 


u u u u u 


1.1 


2.4 


2.5 


1.7 


2.8 


2.4 


- u u u u 


1.6 


3.8 


2.8 


2.5 


3.6 


3.9 


u - u u u 


1.7 


1.9 


2.1 


3.1 


3.4 


6.0 


u u - u u 


1.9 


2.6 


2.6 


2.6 


2.6 


1.8 


u u u - u 


2.1 


3.0 


4.0 


3.3 


2.4 


3.4 


u u u u - 


2.0 


3.8 


4.8 


2.9 


2.5 


3.5 


- - u u u 


2.1 


2.7 


4.3 


3.3 


3.3 


3.4 


- u - u u 


2.1 


1.8 


1.5 


2.3 


4.0 


3.4 


- u u - u 


2.8 


0.6 


0.7 


0.4 


2.1 


1.7 


- u u u - 


4.6 


8.8 


6.5 


4.0 


2.3 


3.3 


u - - u u 


3.3 


3.4 


6.7 


5.3 


3.3 


3.4 


u - u - u 


2.6 


1.0 


0.6 


0.9 


1.6 


3.2 


u - u u - 


4.6 


1.1 


0.7 


1.0 


3.0 


2.7 


u u - - u 


2.6 


1.5 


3.1 


3.1 


3.0 


3.0 


u u - u - 


4.4 


3.0 


1.9 


3.0 


3.0 


2.2 


u u u - - 


2.5 


5.7 


5.4 


4.4 


5.1 


3.9 


— u u 


2.9 


4.2 


5.5 


6.9 


5.2 


3.0 


- - u - u 


3.0 


1.4 


0.7 


2.7 


2.6 


3.3 


- - u u - 


3.4 


1.0 


0.4 


0.7 


2.3 


3.3 


- u - - u 


2.0 


2.3 


1.2 


3.4 


3.7 


3.3 


- u - u - 


6.4 


2.4 


2.8 


1.8 


2.1 


3.0 


- u u - - 


4.2 


0.6 


0.7 


0.8 


3.0 


2.8 


u u 


2.8 


2.9 


2.6 


4.6 


3.4 


3.0 


u - u - - 


4.2 


1.2 


1.3 


1.0 


1.3 


3.3 


u - - u - 


4.8 


8.2 


5.3 


4.5 


4.6 


3.0 


u — u 


2.4 


1.9 


5.3 


2.5 


2.5 


2.2 


u 


3.5 


4.1 


3.3 


3.8 


2.9 


2.4 


- u 


4.0 


3.7 


3.3 


4.9 


3.5 


3.0 


- - u - - 


4.1 


2.1 


2.3 


2.1 


4.1 


6.4 


u - 


4.1 


8.8 


9.0 


6.8 


4.7 


3.8 


u 


2.0 


3.0 


2.9 


2.9 


2.6 


2.2 




4.2 


5.2 


4.0 


4.9 


3.4 


1.8 


no. sentences 


3778 


3783 


958 


770 


919 


762 



Example 2.1. A" = the set of binary fc-tuples. Here is a concrete example 
of a data set with this structure; L. Brandwood classified each sentence of Plato's 
Republic according to its last five syllables. These can run from all short (U) through 
all long (-). Identifying U with 1 and - with 0, each sentence is associated with a 
binary 5-tuplc. As x ranges over Zj, let f{x) denote the proportion of sentences 
with ending x. The values of f{x) arc given in the first column of Table 1. 

A second example of data with this structure is the result of grading cor- 
rect/incorrect in a test with k questions. There are several useful choices of y 
given next: 

2. 1 . Projections for data in 
2.1.1. Marginal projections in Zj 

For i 1, 2, . . . , fc, let yf = {x e Zf : Xi ^ 0}, let y] = {x e Zf : = 1}. The 
sets y ~ {y^}, 1 < * < fc, j G {0, 1} form a projection base. In the Plato example, 
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the projections have a simple interpretation as the proportion of sentences with a 
specific syhable in the i place. Displaying projections offers no problem here; a 
single number suffices. 

A second natural choice of y gives second order margins. This is based on sets 
yfj* = {x G Z2 ■ Xi = a,Xj = b},l < i < j < k, a,b € {0,1}. In this case, a 
projection consists of 4 numbers. In the Plato example, the projection along coor- 
dinates i,j gives the proportion of sentences with each of the 4 possibles patterns 
U U, U -, - U, - - in positions Table 3 in Section 3 is an example of one method 
to display such projections. Section 2 contains an analysis of the data in Table 1 
based on these projections. The analysis gives a clear interpretation to a classical 
way of dating the books of Plato. The analysis is independent of the other examples 
in this section and can be read at this time. 

Here are some examples to show how the structure of / is reflected in /. If 
f{x) = Sx,xo,f{y) = 1 a xo £ y and zero otherwise. If f{x) = ^ for all x, then 
.f{y) = Eind hence is constant for all y. As a final example, consider a fixed, 
non-zero vector y* S Zf. Let S be the hyperplancs determined hy y* : S ^ {x £ 
Z!^ :x-y^O mod 2}. Let 

if X £ S, 
otherwise. 



if z = y*, 
otherwise, 

if z = y*, 
otherwise. 

The hyperplane transform is essentially the same as the ordinary Fourier trans- 
form on the group Z|. This is defined by 

/(z) = ^(-ir 

X 

If / is a probability on Z|, /(z) = 2/(y°) — 1. The transform / has been widely 
used for data analysis of this type of data. See Solomon [44] or Diaconis [18, 19], 
Chapter 11. The discrete Radon transform with projections onto affine hyperplanes 
is also used by Ahn et al. [1]. 

2.1.2. Affine hyperplanes in 

This is one natural way of "filling out" the marginal projections presented above. 
For z e and a e {0, 1}, let y<^ = {x £ Z^ : X ■ z = a mod 2}. The collection 
y = {j/zlzsz'', oG{o,i} forms a projection base. Observe that when z has a 1 in 
position i ancf zeros elsewhere, y° equals the yf of the previous example. The sets 
in y are the affine hyperplanes in Zj . Similarly, the affine planes of any dimension 
form a projection base. An analysis of the Plato data using all affine hyperplanes 
is in Appendix A. 3 below. 



An easy computation shows 

L 2' 

fiyl) = ( ?' 

L 2' 
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2.2. Projections for data in X — Sn, the sets of permutations of n 
letters. 

Permutation data arises in taste testing, ranking and elections; for example, in pres- 
idential elections of the American Psychological Association, members arc asked to 
rank order 5 candidates. Here, for a permutation tt, /(tt) is taken as the proportion 
of voters choosing the order tt. For background and many examples, see Critchlow 
[12], Fligner and Verducci [2.")] or Marden [39]. 

2.2.1. Partitions based on marginal projections of permutations in Sn- 

Let ijij = {tt G 5„ : ■n{i) ~ j, 1 < i,j < n}. These sets form a projection base. For 
fixed i, the sets yn, yi2, . . . , j/m form a partition p{i). The projection in direction 
p{i) has a natural interpretation in the example: how did people rank candidate i? 
The projection can be displayed by making a histogram. 

A second useful choice of y is based on considering two positions: y^j = {tt e 
Sn '■ 7r(i) = k,Tr{j) = 1} with i ^ j,k ^ I. This leads to projections giving the 
joint rankings of a fixed pair of candidates in the example. Such projections can 
be displayed by making a 2-dimensional picture and gray scaling the («, j) square 
to correspond to the proportion of voters ranking the pair of candidates in order 
Similarly third and higher order projections can be defined. 

2.2.2. Partitions based on subgroups of Sn. 

When A" is a group such as Sn, the following constructions for y are available. Let 
be a subgroup of X. The orbits of A^ acting on X are the cosets {Ny}y^x, and 
the distinct orbits partition X. Varying A^ by conjugation, {yNy~^}y^x, gives a 
projection base for X. 

When A^ is taken as Sn-i = {tt G Sn ■ 7r(l) = 1} the projections are the marginal 
projections defined above. Taking A^ as 5,1-2 = {tt G : 7r(l) = l,7r(2) = 2} 
gives the second order margins. An important class of subgroups are the so-called 
Young subgroups: let Ai < A2 < • • • < A„ be a partition of n so J2i = 
Let Sxi X S\2 X • • • X S'a„ be the permutations that permute the first Ai elements 
among themselves and the next A2 elements among themselves, etc. These include 
the previous examples and provide enough transforms for an inversion theory, as 
will be shown below. Display of such projections is not a well studied problem. In 
the case of a projection corresponding to a Young subgroup, one suggestion is a 
1-dimensional histogram using one of the orderings suggested in Chapter 3 of James 
[34]. 

li X = G/H where G is a group and H is a subgroup and G C N C H , with A^ a 
subgroup, then the orbits of A' in A" are a partition and the orbits of {gNg~^}g^c 
form a partition base. One approach to the display of such projections is a 2- 
dimensional histogram using the ordering given by one of the metrics suggested in 
Chapter 7 of Diaconis [18]. 

2.3. Projections for X = W : Euclidean data. 

Consider data vectors xi, X2, . . . , a;„ G W . For 7 in the p-dimensional unit sphere, 
the projection in direction 7 is just 7 • xi, . . . ,7 • a;„. This is the classical Radon 
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transform, with y consisting of the afhne hyperplanes = x G MP : x ■ j = t. For 
fixed 7 these partition the space as t varies, and the partitions vary as 7 varies. In 
statistical apphcations, a histogram is made of 7 • Xi and one varies 7, trying to un- 
derstand the structure of the p-dimensional data from the varying histograms. This 
leads to the classical version of projection pursuit considered in the introduction. 

2.4- Projections when X is a finite set with n elements, and y is the 
class of k -element subsets. 

In this example, if k divides n, it is a non-trivial theorem of Baranyai that y 
forms a projections base. Details and discussion may be found in Cameron [7]. This 
example occurs naturally when considering extensions of a given class of partitions. 
For example, consider the marginal projections yf in Z| defined above. These sets 
all have cardinality |yf | = 2'^^^. It is natural to consider the extension to projections 
based on the class of all subsets of cardinality 2'^~-^. 

2. 5. Uniqueness of Radon transforms: 

We now consider the question: when is / f one to one? A convenient criteria 
involves the notion of a block design. Let \X\ = n. The class of sets 3^ is a block 
design with parameters (n, c, fc, I) provided 

(4) \y\ = c for all y G 3^, 

(5) each x & X is contained in k subsets y, 

(6) each pair x ^ x is contained in I subsets y. 

Affine planes or iJ^ and k sets of an n set are block designs. A great many other 
examples are discussed in the literature of combinatorial designs. In the statistics 
literature they are sometimes called balanced incomplete block designs. In the com- 
binatorial literature they are often called 2-designs, or 2-(n, c, I) designs. It is easy 
to see that the parameters n,c,k,l satisfy 

(7) \y\c = nk, 

(8) {n-l)l ^ k{c-l). 

Bailey [:>], Dembroski [14] and Lander [38] are useful references for block designs. 
The following result is well known in the theory of designs. We first learned it 
from Bolker [4]. 

Theorem 2.2. If X is a finite set and y is a block design with \y\ > 1, then the 
Radon transform f ^ f is one to one, with an explicit inversion formula given by 
{12) below. 

Proof. For any x, 

(9) E = kf{x) + l J2 /(^') 

(10) = {k-l)f{x)+lY^f{x). 
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If J2xex fi^) ~ ^' ^'^^^ determines / as 



(11) 



1 



k^i 



k-i 



y:x£y 



Observe that k > I follows from the assumption that |3^| > 1. When J2xex /(^) 
not known, it can be recovered by summing both sides of (9) in x. This gives 



Remarks. 

• It is not necessary that be a block design for / — > / to be one to one. For 
example, Kung [37] shows that the Radon transform is one to one when y 
consists of the sets of rank i in a matroid. Diaconis and Graham [17] give 
examples where the transform is one to one when y consists of the nearest 
neighbors in a metric space. For example, when X ~ and y consists of 
the balls of Hamming distance less than or equal to 1, the transform is one 
to one, and an explicit inversion theorem is known. When X \s Sn-, the sym- 
metric group, and y is unit balls in the Cayley metric, the transform is one 
to one if and only if n is in {1, 2, 4, 5, 6, 8, 10, 12}. Further work on inversion 
formulas for functions on finite symmetric spaces is found in Velasquez [47] 
and for functions on the torus Zj^ in Dedeo and Velasquez [13]. Fill [24] dis- 
cusses invertibility when the Radon transform of / at a; averages over a set of 
translates of /(cc) which has applications to directional data and time series. 

• The transform can still be useful and interesting if it is not one to one. For 
example, the marginal projections in the example above do not capture all 
aspects of the data but are often the first things to be looked at. In Zj, 
if high enough marginal distributions are considered, the function / can be 
completely recovered. In the symmetric group, the projections correspond- 
ing to all Young subgroups determine / because they determine its Fourier 
transform. See Diaconis [18] for details. 

3. Data analysis of syllable patterns in the works of Plato 

This section presents a new analysis of data arising from syllable patterns in the 
works of Plato. The data are given in Table 1. It records, for each of 6 books, 
the pattern of long (-) and short (U) syllables among the last 5 syllables in each 
sentence. It is known that Plato wrote Republic early and Laws late. Plato also 
mentions that he changed his rhyming patterns over time. This led Brandwood to 
collect the data in Table 1. 

The other books were written between these but it is not known in what order. 
The goal of the analysis is to try to order the books. Our approach will be to study 
the books one at a time, trying to find patterns. 

Projection pursuit suggests looking at various partitions of the data, searching 
for structured partitions which are far from uniform. Using first and second order 




and so the inversion formula 



(12) 




□ 
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margins as partitions, a reasonably striking difference between Republic and Laws 
is observed. This suggests a simple, interpretable way of ordering the other books 
as Republic, Timaeus, Sophist, Politicus, Philebus, Laws. 

This agrees with the standard ordering as discussed in Brandwood {[(>], pg. xviii) 
and in Ahn et al. [[]. Other analyses of this data set are in Cox and Brandwood [11], 
Atkinson [2], Wishart and Leach [49], Boneva ["i], and Charnomordic and Holmes 
[8]. [11] contains a history and explanation for the choice of data. The first three 
analyses all use statistical models. Boneva's analysis uses a form of scaling. None 
of the previous analyses seem to have picked up the simple, striking pattern in the 
data that projection pursuit leads to. 

The analysis is presented below, in a somewhat discursive style, in the order it 
was actually performed: first looking at the Republic, then Laws and finally the 
other books. In the Appendix, we present a more automated and formal version. 

3.1. Republic 

Table 2 shows the first order margins; e.g., the proportion of sentences with U in 
position i, 1 < i < 5. 

Roughly, positions 1-4 are evenly divided between long and short. The last po- 
sition is clearly different. Table 3 shows the second order margins. 

A glance at Table 3 shows that the first order effects are all too visible in the 
second order margins. For example, the numbers in the first column (U U) are 
all "small" while the numbers in the last column are "large". One simple way of 
adjusting for the first order structure is to divide each number in Table 3 by the 
product of the marginal totals. For example, in the first row, .194 would be divided 
by (.465) (.472) (from Table 2) while .271 would be divided by (.465) (1 - .472). The 
results are shown in Table 4. 

Most of the ratios are close to 1, so a product model is a reasonable first descrip- 
tion. The projection pursuit approach suggests that a partition of the data (here a 
row) is "interesting" if the partition is far from uniform. By eye, looking at Table 
4, positions (1, 2), (2, 3), (3, 4), (4, 5) are far from being all 1. Observe that these 
positions are adjacent, as (i, i + 1). 

Next observe that each of the 4 designated rows has a common pattern: the 
first and last entries are small, the middle two entries are large. Going back to the 



Table 2 

First order margins for Republic 



Position 


1 


2 


3 


4 


5 


Proportion of U 


0.465 


0.471 


0.466 


0.511 


0.362 






Table 3 










Second order margins for 


Republic 




Position 


UU 


u - 




- U 




(1,2) 


0.194 


0.271 




0.277 


0.258 


(1,3) 


0.208 


0.257 




0.258 


0.277 


(1,4) 


0.238 


0.227 




0.272 


0.263 


(1,5) 


0.177 


0.288 




0.185 


0.350 


(2,3) 


0.209 


0.262 




0.257 


0.272 


(2,4) 


0.241 


0.230 




0.269 


0.260 


(2,5) 


0.162 


0.309 




0.200 


0.329 


(3,4) 


0.211 


0.255 




0.299 


0.235 


(3,5) 


0.170 


0.296 




0.192 


0.342 


(4,5) 


0.167 


0.343 




0.195 


0.295 
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definitions, this pattern arises from a negative association of adjacent syllables; in 
the Republic, adjacent syllables tend to alternate. The pattern in positions (1,3) 
shows that this cannot be a complete description; after all, if the symbols alternate, 
the positions two apart should be positively associated, but (1,3) displays negative 
association. Looking at the other rows of the table, we observe that the size goes 
big, small, small, big or its opposite, small, big, big, small. This is an artifact. 
Consider the first row of Table 4. It was formed from 4 proportions that sum to 1: 
w, X, y, z say. The 4 adjusted entries are 

w X y z 

{w + x){w + y) {w + x){x + z) {y + z){y + w) {z + y){z + x) 

It is easy to show that the first entry is less than 1 if and only if the second is 
larger than 1, if and only if the third is larger than one, if and only if the fourth 
is less than 1. This means that the first column in Table 4, together with the first 
order margins, determines the remaining entries. This artifact in no way reflects 
on the association pattern noted earlier- the most structured rows correspond to 
adjacent syllables, and adjacent syllables are negatively associated. 

3.2. Laws and a comparison with Republic. 

The first order margins for Laws are only slightly different from those in Republic 
(see Table 5). 

The pattern is the same: overall, fewer than half U's; the last position sharply 
smaller. The similarity between the first order margins in Republic and Laws sug- 
gests that second or higher order margins must be used to order the remaining 
books. The analog of the first column of Table 4 is given in Table 6. 

The entries above are the proportion of sentences with U U in the position 
divided by the product of the marginal proportions. 



Table 4 

djusted second order margins for Republic 



Position 


U U 


U - 




- U 




(1,2) 


0.89 


1.10 




1.10 


0.91 


(1,3) 


0.96 


1.00 




1.00 


0.97 


(1,4) 


1.00 


1.00 




1.00 


1.00 


(1,5) 


1.10 


0.97 




0.96 


1.00 


(2,3) 


0.95 


1.00 




1.00 


0.96 


(2,4) 


1.00 


1.00 




1.00 


1.00 


(2,5) 


0.95 


1.00 




1.00 


0.97 


(3,4) 


0.89 


1.10 




1.10 


0.90 


(3,5) 


1.00 


1.00 




0.99 


1.00 


(4,5) 


0.90 


1.10 




1.10 


0.94 






Table 5 










First order margins ft 


3r Laws 






Position 




1 2 


3 


4 


5 


Proportion of U U 0.477 0.489 


0.411 


0.599 


0.375 






Table 6 










Adjusted : 


second order margins for 


Laws 




Positions 


(1,2) 


(1,3) 


(1,4) 


(1,5) 


(2,3) 


Adjusted U U 


1.07 


1.03 


0.92 


0.99 


1.43 


Positions 


(2,4) 


(2,5) 


(3,4) 


(3,5) 


(4,5) 


Adjusted U U 


0.97 


0.98 


1.04 


1.09 


1.02 
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Again, pairwise adjacent positions are associated, all in the same way. Here, the 
association is positive, whereas for Republic, the association is negative. This is the 
striking pattern referred to above. It suggests a method of ranking the other books: 
compare the sign pattern or actual ratios of the adjusted second order margins of 
other books with Republic and Laws. 

For definiteness, the sum of absolute deviations between second order margins 
over all 10 positions will be used. This is carried out data analytically in Sections 
3.3-3.5. 

3.3. Analysis for Philebus and Politicus 

These books are somewhat similar to each other. The first and second order margins 
for Philebus arc given in Tables 7 and 8. 

Note the difference in first order margins: between Philebus and Republic (or 
Laws) position 1 is high, as are positions 4 and 5. For second order margins, the 
adjacent patterns are all positively associated ((2,3) being truly extreme). Compar- 
ing Table 8 with Table 6, the association pattern matches Laws in direction, except 
in position (1,5). The relevant averages for Politicus are given in Tables 9 and 10. 

The first order margins are, very roughly, like those in both Republic and Laws, 
but again the third position has a low proportion of short syllables. The second 
order margins have the same pattern as Laws. The same remarks made for the 
second order margins of Philebus apply. 

Both Philebus and Politicus seem very similar to Laws. Which of these two is 
closest to Laws'! One simple approach is to consider the sum of the absolute values 
of the differences between the entries of Tables 8 and 6 along with the differences 
between 10 and 6. The sum for Laws to Philebus is .64, while the sum for Laws to 
Politicus is .83. Thus a tentative ranking is: Politicus, Philebus, Laws. 

Table 7 



First order margins for Philebus 


Position 


12 3 


4 


5 


Proportion of U 


0.522 0.464 0.398 


0.594 


0.465 




Table 8 






Adjusted second order margins for Philebus 




Positions 


(1,2) (1,3) (1,4) 


(1,5) 


(2,3) 


Adjusted U U 


1.11 1.03 0.85 


1.11 


1.48 


Positions 


(2,4) (2,5) (3,4) 


(3,5) 


(4,5) 


Adjusted U U 


0.92 0.85 1.02 


0.95 


1.01 




Table 9 








First order margins for Politicus 






Position 


12 3 


4 


5 


Proportion of U 


0.477 0.457 0.348 


0.524 


0.469 




Table 10 






Adjusted second order margins for Politicus 




Positions 


(1,2) (1,3) (1,4) 


(1,5) 


(2,3) 


Adjusted U U 


1.17 1.10 0.96 


1.01 


1.26 


Positions 


(2,4) (2,5) (3,4) 


(3,5) 


(4,5) 


Adjusted U U 


0.86 0.90 1.05 


1.10 


1.13 
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3.4- Analysis for Sophist and Timaeus 

These books arc quite similar to each other and, as we shall see, quite different 
from Laws, Philebus and Politicus. 

The first order margins are quite different from the books examined previously. 
They are roughly consistent with all syllables being equally likely to be long or short. 
The first order pattern seems closest to Politicus. The second order associations are 
closer to 1 than in Laws, Politicus or Philebus. Adjacent positions are positively 
associated, except for (3,4). The direction of association matches Laws in only 6 of 
f positions. The sum of absolute deviations between the entries of Tables 6 and 
12 is .87. 

We now give the analysis for the final book. 

A distinctive feature of the first order margins is the large proportion of short 
syllables in the third position. The adjusted second order margins are close to 1, so 
Timaeus seems closest to Sophist. Of the 4 adjacent positions, two show positive 
association and two show negative association. The direction of association matches 
Laws in 6 positions; the sum of absolute deviations between Tables 14 and 6 is .94. 
The distance between Timaeus and the Republic (Tables 14 and 4) is .6, so Timaeus 
seems closer to Republic than to Laws using this measure. Because of the decrease in 
the number of matches and the increase in the sum of absolute deviations, it seems 
reasonable to rank order the three as Republic, Timaeus, Sophist. This completes 
the discussion of this example. The Appendix contains an automated version. 

4. Most projections are uniform 

Graphical projection pursuit is a standard tool in data analysis. The classical survey 
of Huber [32], the survey of Posse [42] and the online documentation in the Xgobi 
and Ggobi packages contain extensive pointers to a large literature. 



Table 11 
First order margins for Sophist 



Position 


12 3 


4 


5 


Proportion of U 


0.474 0.491 0.454 


0.527 


0.487 


Table 12 
Adjusted second order margins for 


Sopliist 




Positions 


(1,2) (1,3) (1,4) 


(1,5) 


(2,3)) 


Adjusted U U 
Positions 
Adjusted U U 


1.07 1.03 1.01 
(2,4) (2,5) (3,4) 
0.88 1.01 0.97 


0.93 
(3,5) 
0.98 


1.07 
(4,5) 
1.10 


Table 13 
First order margins for Timaeus 


Position 


12 3 


4 


5 


Proportion of U 


0.494 0.476 0.565 


0.521 


0.496 


Table 14 
Adjusted second order margins for 


Timaeus 




Positions 


(1,2) (1,3) (1,4) 


(1,5) 


(2,3) 


Adjusted U U 
Positions 
Adjusted U U 


0.98 1.02 0.97 
(2,4) (2,5) (3,4) 
0.94 0.97 0.96 


1.04 
(3,5) 
0.97 


0.92 

(4,5) 
1.06 
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The theorems of this section imply that for most data sets f{x), most projections 
f{y) are about the same: close to uniform. This necessitates projection pursuit ~ 
choosing projections that are far from uniformly distributed - to determine what 
is special about a particular /. This gives an independent rationale for Huber's 
suggestion that Euclidean projections arc interesting if they are far from uniform 
in the sense of having minimum entropy (of course, the uniform distribution on a 
finite set has maximum entropy). 

Theorem 4.1. Let X be a finite set with n elements. Let y be a block design 
with block size c (so \y\ — c for y G y). Let f : X M. be any function and let 
t^i f) = l^xex /(^)- y chosen uniformly in y. Then 

(13) my) = -m(/), 

n 

(14) var/(y) = ^(1 - j£^)^(/ - ^)^ 

n [n — 1) n 

Proof. (13) follows from computing 

y ^ 

For (14), assume without loss of generality, that /i(/) = 0. Then 

var(/(2/)) = ^E/(2/)' = ^E E/(^)(/(^)+ E /(^')) 
y y \ ^^y 



\y\ 

From (7) and (8), ^ = ^g^, giving the result. □ 

Example 4.2. When y is the j sets of an n set, |3^| = ("),c = j, and the result 
reduces to the usual mean and variance for a sample without replacement. 

Example 4.3. Let X — a-nd y be the j-dimensional afhne planes. Then n = 2'"' 
and c = 2'^"-'. If fi{f ) = 1, the result becomes 

E(/(2/)) = |, var(/(y)) = ^(1 " ^^^^)m(/ - ■ 

For future use, observe that the cardinality of y in this case is 

2J(2'^-l)(2'=-2)---(2'=-2J'-i) 
{23 - 1) • • • (2J - 2J-1) ' 

Returning to the situation in Theorem 2.2, Chebychev's inequality implies: 

Corollary 4.4. With notation as in Theorem 2.2, the proportion of y E y such 
that 

\fiy)-^Kf)\>e 



smaller than 



n n — I n 
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Remarks. The coroUary imphes that for functions / which are "not too wild" in 
the sense that — ttillY is small, most transforms /(y) are uninformative in 
the sense of being close to their mean value. As an example, take A" = and / 
the function defined by the first column of Table 1. Then /^(/ — = -0021. If 
y is taken as the set of all affine hyperplanes, the corollary gives that 95% of the 
transforms have |/(t/) — ^| < -04. 

The next theorem says that for most probabilities /. — —Y is small (about 

Theorem 4.5. Let {Ui,U2, ■ ■ ■ ,Un) be chosen uniformly on the n simplex. For 
large n, the random variable 

has an approximate standard normal distribution. 

Proof. The argument uses the representation of a uniform distribution by means 
of exponential variables. Let Xi,X2, ■ . ■ ,Xn be independent standard exponential 
variables with density on [0, oo). Let 

n n 

5*1 =^X, S2^^^Xf. 

i=l i=l 

For large n, the random vector 

has an approximate bivariate normal distribution with mean vector zero and co- 
variance matrix (^^ 20) • To check the covariance matrix, note that var( '^^" ) = 

var(Xi) = l,var(^a^) = v£iT{Xf) = 20 and 

-E{{Si-n){S2-2n)) = E {{X^ ~ l){Xf - 2)) 
n 

= E{Xf) - E{X^) - 2E{Xi ) + 2 = 4. 
Represent a uniform vector on the n simplex as Ui = . Then 

E(t/^-V = ^E^r- = ^E(xf-2) + ^-i. 

i—i ^ i—i ^ i—i ^ 

Now Si = nil + 4^) with Zi = Thus 

Sl = n\\ + ^Zi + ^). 

n 

Using the standard Op notation (sec Pratt [4.S]), 
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Thus, 

n n 

2n _ 2 _ 4Z]_ 1 
Si ~ n n3/2 +^P^n2^- 

The bivariate Hmiting normahty of {^^^ impUes that Z2 — 4Zi has an approximate 
normal distribution with mean and variance 

var(Z2) + 16 var(Zi) - 8 covar(Zi,Z2) = 4. □ 

CoroUary 4.4 and Theorem 4.5 imply that for most probabilities /, most trans- 
forms /(y) are close to uniform. 

The final result of this section deals with the entire projection f{y)y^p where p 
is a partition of X into blocks in y. Let A" be a finite set. Let 3^ be a block design 
on X with parameters {n,c,k,l). Suppose that y is also a projection base for X 
with j)i,j52, ■ • ■ ,Pj being a partition of y , and each pi being a partition of X . Of 
course, j = -^^^^ The next theorem implies that for most functions, the projection 
onto a randomly chosen partition is uniformly close to ^ . 

Theorem 4.6. Let y be a block design on X with parameters (n,c,k,l). Suppose 
that y is a projection base. Let f be a fixed probability on X . Let the partition p be 
chosen uniformly at random over all partitions pi of X , where pi C y. For e > 0, 



(15) Ei/(y)--i^^- 

y<£p 



with probability at least 



1 f n{n - c) 1 



e \ c{n + 1) n 



Proof. The probability model for choosing a random partition is based on a fixed 
enumeration pi,p2, ■ • ■ of the partitions that make up y. Each partition is as- 
sumed to be taken in a fixed order pi = . . . , y"^*^)}. The random variable 
^{p) = X^yGp [/(y) ~ ^1 is invariant under permuting the y G p among them- 
selves. Thus a random variable with the same distribution of S{p) but exchangeable 
fiy)y^j, exists. For this realization, E(E,ep iRv) " f I) = f E|/(2/*) - f | with y* 
chosen uniformly in y. Using Cauchy-Schwartz and Theorem 4.1, the expectation 
is bounded above by 




Theorem 4.6 follows from this bound and Markov's inequality applied to the 
original random variable. □ 

Remarks. From Theorem 4.5, = ^ for most functions /. For such /, the 

theorem implies that for large block size c, most partitions are close to uniform in 
variation distance. This may be contrasted with Theorems 4.1 and 4.5 which imply 
that the components f{y) of most projections are close to ^. When c is small, there 
are many terms in the sum (15). As an example, consider the 2-sets of an n set 
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where n = 2j. Let p be a random partition into 2-clenient sets. Let / be chosen at 
random from the n simplex and p any fixed partition into two element sets. It is 
straightforward to show that with probability tending to 1 as n tends to infinity, 

yep 

The analogous result holds with the same assumptions when p is any fixed par- 
tition of fixed size c. Similarly, it is natural to ask for a central limit theorem in 
connection with Theorems 4.1 and 4.5. For j sets of an n set, such a theorem is 
available from the usual results on sampling without replacement from a finite pop- 
ulation. Most likely, there is a similar set of results for block designs with |3^| and c 
large. See Stein [45] for results for designs arising from subgroups of a finite group. 



5. Least uniform partitions 



The results of Section 4 imply that, under suitable conditions, for most functions 
the projection along most partitions is close to uniform. This suggests that the 
special properties of particular functions arc only seen in partitions that are far 
from uniform. In this section, properties of least uniform partitions arc examined. 
Theorem 5.1 shows that for most functions, even the least uniform partitions will 
be close to uniform if the the number of sets in y is small in the sense that log |3^| 
is small compared both to n and the block size c. This is true, in particular, for 
affinc hyperplancs in 

Theorem 5.1. Let X be a set of n elements. Let y be a class of subsets in X 
of fixed cardinality c. Suppose that pi,...,pj is a partition of y into partitions of 
X . Let f be chosen at random in the n simplex. Let p* be the partition in pi that 
maximizes J2yep Ifiv) ~ f I- For any e > 0, 



Ei/(^)--i<^' 

yep' 

except for a set of f 's of probability smaller than 



{\y\ + ^)p 



with (3 equal to 1 minus 



1 



/3(c,n) 



(16) — -/ x^~\l~xT-'-'dx 



^(1-0 



where /3{c, n) denotes the beta function. 

Proof. Represent the component of a randomly chosen / as ^ where Xi are 
independent standard exponentials and S = X^ILi -^i- ^'^^ V* ^'^^ ^ with 

the largest value of ;^(1 — e). The argument begins by bounding the probability 
that 

n n 



To begin with, 

pf/V,<;:,l-„)^p(^l±^<i(l-. 
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Further, 



P 



/V)>^(l + e)) < ^P(/(2/)>^(l + e) 



\ D n 

Next, let y^, denote the set in y with the smallest value of f{y). To bound the 
probability that — f | < , observe that /(y*) = 1 — f{y**) with y** the 

union of sets in a partition omitting the one element that maximizes /. Thus, 



{Ry*) 



<-{l-e)) = P(/V*) >!--(! -e) 



5* 
S' 



n 



\y\v[^^i±—±^<'-{i-e) 



n 



Further, 



p(/(y.)>^(l + 6)) = p(/(y")> 1 + ^(1 -e)) 

„ / X\ + • • • + X„^r C , 

< P[— ? ^<l--l + e 

\ D n 

\ 5 n 

Summing the four bounds thus obtained we see that both 

(17) 1/(2/*) --!<£-, \f{y*)--\<e- 

n n n n 

except for a set of /'s of probability smaller than (|3^| + 1)/3 as defined by (16). Now 
(17) implies that \f{y) — ^| < for all y G y. Summing this last inequality over 
the partition p* completes the proof of the theorem. □ 

Remarks. The beta integral that appears in the bound is straightforward to ap- 
proximate numerically. A raft of techniques and approximations appear in the first 
chapter of Pearson [40]. For example, consider cases where ^ = ^- Then, using the 
Peizer-Pratt approximation given in Pearson [4(i], and Mills' ratio, the f3 in (16) is 
approximately 



, with X = . 2c log — . 

^l + x y ^4(i-e)(i+e) 

For this to be small when multiplied by |3^| + 1, it clearly suffices that log |3^| be 
small compared to c. This is the case for the affine subspaces of dimension j in 
if j is bounded and k is large. 

As a numerical example, consider the affine hyperplanes in Zj*^. Then |3^| + 1 = 
2049, c = 512, n = 1024. Taking e = 0.1, (|3^| + l)|/3 = 2.595 x lO"^. 

The next theorem shows that when there are many sets in y, the least uniform 
projection is typically far from uniform. The theorem deals with n sets in a set 
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of cardinality 2?!. The variation distance of a typical probability projected along 
the least uniform half split is shown to be about 0.3. This may be compared with 
Theorems 4.5 and 4.6 which show that for a typical probability / on 2n points, 
l/(y) ~ ?l is close to zero for most sets y of cardinality n. 

Theorem 5.2. Let f be chosen at random on the 2n simplex. Let S~ he the sum 
of the n smallest f{x). Then for large n, the random variable 

has an approximate normal distribution with mean and variance | — 2 log 2. 

Proof. Represent a randomly chosen / as ^ where Xi are independent standard 
exponential random variables and S = Y^'^Zi-^i- Denote the order statistics by 
round brackets: 

Let Li = L2 = -^^(2) - -'^(1), • • • , = X(2n) " -'^(2n-i)- Then the Li are 
independent, and Li+i has the distribution of a standard exponential times ^2ri-i) ~ 
see Feller ([2.3] Section III. 3). With this notation, 

2n 2n-l 

(18) S = = ^(2n-i)L,+i, 



n-l 



(19) 



s- = 



S 



i=0 



The proof is completed by approximating the sums in this representation of S 
and S~ . Let fij — 2n^i ' ^ *)^j+i the same distribution as fii times a 

standard exponential. Let 



a2 = 2^^f = 25^(1 



i=0 



i=0 



2n 

2n-i ~^ {2n-i)^' 



2(n-(2nlog2 + 0(l)) + ^ + ^+0(l) 



2n 



21og2 +0(1). 



Now, let Zi = and Z2 = ^^'=^51'' The vector (^1,^2) has a limiting 



bivariate normal distribution, with mean (0, 0) and covariance matrix ^2) with 

CT^ = 2, cr| = I — 2 log 2, and p = ^(1 — log 2). To check the value of p, observe that 
the covariance of Zi and Zo is ^ times 



i=0 



[(2n-i)L,+i-l 



[n — i)L. 



1 



2n — i 



En — t 
2n-i 

i=0 

= n-nlog2 + 0(l). 
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Using the standard Op calculus 



In particular, 



1 = ^ + ^^ 



The representation (19) for S" can be rewritten as 



X S ^2 2 V V2nJ ' \n 

It follows that V2^{S- - i^^Y^) has the same limiting distribution as Z2 — 
(i-iog2) xhis is normal with mean and variance 

3 \ /l-log2\ /l-log2\ 3 

2 - 2 log 2 j + 2 (^^^ j - 2 (^^^ j ^ 2 - 2 log 2. □ 

Corollary 5.3. Let f be chosen at random on the 2n simplex. Let (?/, y'^) be a 
partition of X into an n set and its complement which maximizes the value of 

Then, as n tends to infinity, the maximum discrepancy tends to log 2 = .301 with 
probability tending to 1. 

Proof. For almost all /, the maximum is taken on uniquely at the partition S~ , 
{S~Y ^ defined in Theorem 5.2. The maximum discrepancy equals 

21^-^1, 

and the result follows from Theorem 5.2. □ 

Remark. The proof of Theorem 5.2 and its corollary can easily be extended to 
cover the j sets of an n set. The argument shows that for most probabilities /, the 
variation distance between the least uniform projection and the uniform distribution 
is bounded away from zero if j is an appreciable fraction of n. 

For the final theorem, a different method of choosing a random probability is 
introduced. Let A" be a set of cardinality 2n. Fix an integer b. Drop b balls into 
2n boxes, and let f{x) be the proportion of balls in the box labeled x. Let y be 
the subsets of X with cardinality n. Clearly, if b is large with respect to n, f{x) 
is approximately ^ and so for any y e y,f{y) = \, even for the minimizing 
fiy). At the other extreme, if b is small with respect to n, /(y*) will be close to 
zero. For example, if 6 = n, /(y*) = 0. It will follow from Theorem 5.4 that ./(?/*) 
is approximately zero for v < 27ilog2. 

This model for generating a random probability gives insight into the following 
problem. If data is generated from a structureless model, random fiuctuations may 
produce structure that is picked up by a rich enough data analytic procedure. 
As b varies in the above model, the random probability converges to a uniform 
distribution. The following theorem gives an indication of how large b must be 
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Table 15 



A 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 




0.74 


0.54 


0.44 


0.40 


0.36 


0.32 


0.30 


0.28 


0.26 


0.24 



for all projections to be close to uniform. Some required notation: For A < 0, let 
PxiJ) = ~jr~ denote the Poisson density. Let P\{j) = X]i=o^'->^(*)- ^^"^ ^ ^'^ 
largest integer with P\{m) < i, Px{m + 1) > i. Define 9 = 9{X) by 

Pxirn) + 0px{m + 1) ^ ^, so < < 1. 
When A is an integer, Ramanujan showed that 

6* = i + 0{\) as A ^ cx). 
o A 

See Cheng [9] for references and extensions of Ramanujan's results. 

Theorem 5.4. Suppose that n and b tend to infinity in such a way that ^ ^ A. 
Let y^f be the n set with smallest value oj f{y^). Then 

1 - 1 '^p^'^A™ / A \ 



Remarks. For A < log 2 and m = , the variation distance can be shown to tend 
to one. For large A, - — y— is roughly , ; thus for large A, the variation distance 
tends to zero like This is not very rapid as Table 15 shows. (Note that for 
integer A, m + 1 = A, so the asymptotic value of the variation distance is 2- — j—.) 



Proof. The argument will only be sketched. For b and n large, the number of balls 
in the i^^ box has a limiting Poisson distribution with parameter A, and different 
boxes can be treated as independent. The arguments in Diaconis and Freedman 
([15], Section 3) can be used to justify this step. 

Thus let Xi,X2, ... be independent Poisson variables with mean A. With prob- 
ability 1, eventually the median of Xi,X2, . . . , X2n is m + 1 and the proportion of 
Xi, 1 < i < 2n equal to j is px{j) + o(l) uniformly for < j < m + 1. Let S~ be 
the sum of the n smallest Xi, I < i < 2n. It follows that ^ equals 

Opa(O) +pa(1) H ^ mpx{m) + e{m + l)pxim + 1) + o{l). 

This sum equals 

2 to! \ TO + 1 / 

The identity asserted in the theorem follows from noting that /(t/*) is the limiting 
value of Sr—. □ 

2Xn 



Appendix: Automating the analysis 



In Section 2, we used the adjusted second order margins in a graphical, data analytic 
fashion to seriate the books of Plato. For some purposes, it may be desirable to have 
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a more formal ranking procedure. We carry this out in Section A.l. The procedure 
is based on a collection of metrics between probabilities. These are explained in 
Section A. 2. Finally, in Section A. 3, we carry out a fully automated analysis of the 
Plato data based on all affine projections, not just first and second order statistics. 
We conclude that most methods agree, and suggest that the structures described in 
Section 3 are robustly embedded in the Plato data. In this section, we have added 
a seventh book, Criticus, to the analysis. 

A.l. A metric approach 

In our data analysis, the adjusted second order statistics emerged as an informative 
summary of the rhyming patterns in Plato's Republic. As explained in Section 2, 
this is a vector of ten numbers (one for each pair of the last five syllables, i.e. 
(2) = 10). For the moment, call this vector = {pf-, . . . ,p{o) with "i?" denoting 
Republic. A similar ten-vector can be computed for each of the other books. We 
may then use the distance between these vectors and p^ to order the books. Books 
closest to p^ are ranked earlier. We also compute a ranking based on the distance 
to p^, the adjusted second order statistics for Plato's Laws. These two rankings 
generally agree, and agree with the conclusions of Section 3. 

To proceed, we need to choose a distance between vectors. We have examined 
three standard distances between probability vectors: the Hellinger distance H, the 
Total Variation distance TV, and the Vasserstein distance V. These are explained 
more carefully in Section A. 2. The rankings are given in Tabic 16: R denotes Re- 
public, L denotes Laws, ■ denotes row variable. 

Almost the same seriation is obtained when any of the three metrics are used 
to compute distances between Republic and the other books. Similarly, almost the 
same seriation is obtained when any of the three metrics arc used to compute 
distances between Laws and the other books. Most clearly, Politicus is closest to 
Laws and furthest from the Republic. Timaeus and Sophist, as a pair, are closest 
to Republic and furthest from Laws. However, Sophist is both closer to Laws and 
to Republic than Timaeus. From these calculations, aside from Politicus, Philebus 
is closest to Laws and furthest from Republic. This is then followed by Criticus. All 
of this points to the ordering: Republic, {Sophist, Timaeus}, Criticus, Philebus, 
{Politicus, Laws}. 

This ordering is consistent with the ordering produced data analytically in Sec- 
tion 3 and with the ordering based on the exponential model of Cox and Brand- 
wood [] I]. In Ahn ct al. [1], a total of ten books were used for analysis. They found 
"roughly three clusters" (618): {Tim., Soph, Crit., Pol. * } { Laws, Phil. }, { Rep, 
*,* }. Here * denotes a book not analyzed in our work. Their final ordering based 
on a cluster analysis using the Euclidean metric is Republic, Timaeusus, Criticus, 
Sophist, Politicus, Philebus, Laws. 

Table 16 







Ranking of book in 


row based on 


distance in 


column 




Book 


dniR,-) 


dTv(R,-) 


dv{R,-) 


dniL,-) 


dTv(L, ■) 


dv(L,-) 


Tim. 


2 


2 


2 


5 


5 


5 


Soph. 


1 


1 


1 


4 


4 


4 


Pol. 


6 


5 


6 


1 


1 


1 


Crit. 


3 


3 


3 


3 


3 


3 


Phil. 


4 


4 


4 


2 


2 


2 


Laws. 


5 


6 


5 
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A. 2. Some metrics 

Let p = {pi, . . .pn), q = (gi, ... (7,1) be probability vectors. Thus pi > and pi + 
. . . + Pn = 1, and the same holds for q. Three widely used metrics are : 

Total Variation: drvip, ?) = i J2i \Pi ^ 
Hellinger: dnip, q) = J^iiVPi ~ V^f- 

Vasserstein: dv{p, q) = minx,yE(d(X, Y)). 

where the minimum is over all joint distributions of X and Y with marginals p and 

q- 

These metrics, their strengths, weaknesses and relations are discussed in Dudley 
[21], Villani [48] and Diaconis et al. [20]. 

In Section A.l, we used these metrics between vectors of positive entries which 
did not necessarily have sum one. This was done by forming p = J2iPi^ 'I = 
p = q = ^. We used the distance between p and q and added a penalty term to 
account for differences in mass between the profiles p and q. For total variation, the 
penalty was |p — g|. We computed and compared two penalty terms for Hellinger: 
both \p - q\ and (y^ - y/qf. 

Thus, the distances between the ten-vector of adjusted second order margins of 
Republic and the other books, using Vasserstein is given in Table 17. 

For completeness, we note that the Vasserstein metric requires an underlying 
distance on a probability space; in our case, this amounts to an underlying distance 
between the ten entries in each table. We take these entries to be binary 5-tuples 
containing two ones. We use the distance between two of these as the minimum 
number of pairwise adjacent switches required to bring one to the other. Thus the 
distance between 11000 and 00011 is 6. Further background can be found in Diaco- 
nis et al. [20] or Thompson [46]. With this choice specified, the minimization prob- 
lem is equivalent to the Monge-Kantorovich Transshipment problem. We computed 
distances using the CS-2 code of Andrew Goldberg (www/avglab.org/andrew). 

A. 3. Using all affine projections 

The data analysis of Section 2 used projections into first and second order margins. 
The general theory developed later points to all affine projections as a natural base 
for analysis. In this section, we complete our analysis of the Plato data by looking 
at all affine projections. 

In the following, x and z range over all binary 5-tuples. If ,f{x) is the proportion 
of sentences in a fixed book (eg. Republic) with rhyming pattern .t, the projection 
of / in direction z is 

x-z—Q x-z—1 



Table 17 
dy for Republic to other books 



Book 


Vass. Dist. 


Mass diff 


Total 


Rank 


Laws 


109 


951 


1060 


5 


Phil. 


119 


748 


867 


4 


Pol. 


112 


952 


1064 


6 


Soph. 


82 


97 


179 


1 


Tim. 


41 


263 


304 


2 


Crit. 


71 


675 


746 


3 
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Table 18 



(00010) 



(01100) 



(11000) 



Rep. 

Tim. 

Soph. 

Pol. 

Phil. 

Laws. 

Crit. 



1 
2 
3 
4 
7 
6 
5 



2 
3 
4 
5 
6 
7 
1 



1 
2 
4 
7 
3 
5 
6 



To use the information that Republic was written early and Laws was written 
late, we find 5-tuples, z, that maximize 



where g(x) codes patterns for Laws. The largest three differences occur at z = 
(00010), (01100) and (11000). For each of these, we calculated 



for each of the books (where h codes the patterns for a particular book), and 
use the linear order of these values to order the books. The rank order resulting 
from the three binary 5-tuples, z, with the largest three differences above z = 
(00010), (01100) and (11000) are given in Table 18. 

The first column thus gives the ranking: Rep., Tim. Soph., Pol., Crit., Laws, Phil. 
This is based on the difference between a single syllable (second from the end) . It is 
close to, but not the same as the ranking based on adjusted second order margins 
found above. The other columns differ and show that not 'any old' projection gives 
the same ranking. 

Acknowledgments. This paper is written in tribute to David Frecdman with 
thanks for his integrity and brilliance. 
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